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ABSTRACT 

Cyclotomic fast Fourier transforms (CFFTs) are efficient 
implementations of discrete Fourier transforms over finite 
fields, which have widespread applications in cryptography 
and error control codes. They are of great interest because of 
their low multiplicative and overall complexities. However, 
their advantages are shown by inspection in the literature, 
and there is no asymptotic computational complexity anal- 
ysis for CFFTs. Their high additive complexity also incurs 
difficulties in hardware implementations. In this paper, we 
derive the bounds for the multiplicative and additive com- 
plexities of CFFTs, respectively. Our results confirm that 
CFFTs have the smallest multiplicative complexities among 
all known algorithms while their additive complexities render 
them asymptotically suboptimal. However, CFFTs remain 
valuable as they have the smallest overall complexities for 
most practical lengths. Our additive complexity analysis also 
leads to a structured addition network, which not only has low 
complexity but also is suitable for hardware implementations. 

1. INTRODUCTION 

Discrete Fourier transforms (DFTs) [ 1 1 have widespread ap- 
plications in error control codes and cryptography, which 
in turn are important in almost all digital communication 
and storage systems. For example, the syndrome decoders 
of Reed-Solomon codes [2| require DFTs over finite fields 
to implement the syndrome computation and Chien search 
efficiently (see, e.g., ||3|). Multiplications over GF(p m ) can 
also be implemented efficiently by DFTs via the convolution 
theorem [ 1 1 when they are formulated as multiplications of 
polynomials over GF(p). 

Recently, very long DFTs over finite fields are needed in 
practice. For example, Reed-Solomon codes over GF(2 12 ) 
with thousands of symbols are considered for hard drive and 
tape storage as well as optical communication systems to 
increase the data reliability, and the syndrome decoders of 
such codes require DFTs of lengths up to 4095 over GF(2 12 ). 
However, direct implementations of DFTs have quadratic 
complexities with the lengths of DFTs, and the computa- 
tional complexity is prohibitive for the DFTs with thousands 



of symbols. Therefore we need low-complexity algorithms 
and efficient hardware implementations for DFTs over finite 
fields. 

The cyclotomic fast Fourier transforms (CFFTs), first pro- 
posed in [4], have attracted a lot of attention because of their 
low multiplicative and overall complexities. Though these ad- 
vantages of CFFTs have been demonstrated for short to mod- 
erate lengths in the literature (see, e.g., (5)), it is unclear if 
they still hold for large lengths. Therefore asymptotic compu- 
tational complexity analysis is required to compare the com- 
plexities of CFFTs with other existing DFT algorithms over 
finite fields [6 -8], which can help system designers to find the 
optimal implementation of very long DFTs. 

Another issue regarding the CFFTs is their relatively high 
additive complexities, which hinder their usages. Though the 
additive complexities of CFFTs can be reduced by the com- 
mon expression elimination (CSE) algorithm in (5), the lack 
of addition network structure increases the difficulty of wiring 
and module reusing, and introduces other problems to the 
hardware implementation. Therefore, a structured additive 
complexity reduction method is appreciated for CFFTs. 

In this paper, we analyze the asymptotic computational 
complexities of CFFTs and derive bounds on the multiplica- 
tive and additive complexities of CFFTs. The comparisons 
between our results and existing algorithms show that CFFTs 
have the smallest multiplicative complexity, but their high ad- 
ditive complexities render them not asymptotically optimal. 
However, CFFTs are still valuable as they have the small- 
est overall complexities for most DFTs with practical lengths. 
Our additive complexity analysis also leads to a structured ad- 
dition network, which not only has low complexity but also is 
suitable for hardware implementations. 

2. CYCLOTOMIC FAST FOURIER TRANSFORMS 

To make our paper self-contained, we first review CFFTs over 
GF(2 m ) briefly in this section. Let a € GF(2 m ) be an ele- 
ment of order n, where n\2 m — 1. Consider an n-dimensional 
vector f = (f , fi, ■ ■ ■ , f n -i) T overGF(2 m ), whose polyno- 
mial representation is given by f(x) = Yh=q fi x ' '• The DFT 
of f is F = (F , F ir ■■ , F„_!) T , where F j = /(a*). 



We partition the set of integers {0, 1, • • • , n — 1} into k 
cyclotomic cosets modulus n with respect to two as: 

{0}, {si, 2*i, • • • , 2 mi - 1 Sl }, {s 2l 2s 2 , ■■■ , 2 m ^h 2 }, ■ ■ ■ 
{s fc _i,2« fc _i,... ,2 mfc - 1 - 1 s fc _ 1 }, 

where rrii is the size of the i-th cyclotomic coset, and 
Si = 2" li Si (mod n — 1) is its representative. Then the poly- 
nomial f(x) can be decomposed as f(x) = Yli=o Li(x Si ), 
where (y) = Y^Iq 1 fast modn ■ The polynomial 
Li(x) has a property such that L. L (x + y) = Li(x) + Li(y), 
for 1,56 GF(2 m ), which is used to reduce the DFT compu- 
tational complexity in CFFT 

The element Fj in the DFT result F can be expressed as 
Fj = = Yli=o Li(a^ Si ). By the normal basis the- 

orem |9j, there is a normal basis {7,^° ,7? , • • • ,7? } of 
GF(2 mi ), such that a Si € GF(2 m ) can be represented as 
YlH^o a iJ,s7i < where asj^s is binary. Therefore 

k—Xmi — X /m,-l \ 

i=0 s=0 \ t=0 / 

Writing in the matrix form, we have that the DFT can be 
computed by F = ALf ', where f ' is a rearrangement of f ac- 
cording to the cyclotomic coset, i.e., f = (fg, f[, ■ ■ ■ , f'k^i) T 
with i[ = {f Si ,f 2si r-- ,/ 2 ™ I - 1 s J' A is an n x n bi- 
nary matrix accumulating the coefficients <Xij, s , and L is 
a block diagonal matrix with sub-matrices Lj's on its di- 
agonal. Block Li is an m, x m t circulant matrix cor- 
responding to a cyclotomic coset of size rrii, and it is 
generated from a normal basis {7? , 7? , • • • , 7? } of 
GF(2 mi ). Therefore, the multiplication between Li and f- 
can be formulated as an mj -point cyclic convolution between 
hi = (7?° ,7? w ' < " 1 ,7? mi_2 ,--- ,7f) T and f{. Since the 
matrix A is binary, the product between A and the vec- 
tor v = Lf can be simply computed by additions. All 
the multiplications needed by CFFTs are contributed by the 
convolutions between hi and i[. Because the short convolu- 
tions can be computed by efficient bilinear algorithms (see, 
e.g., JTJ), CFFTs have very low multiplicative complexities. 
However, if implemented directly, they will have very high 
additive complexities. 

3. COMPUTATIONAL COMPLEXITIES OF CFFTS 
OVER CHARACTERISTIC-2 FIELDS 

In our complexity analysis of CFFTs, we aim to theoretically 
show that their multiplicative complexities are the smallest 
among all known techniques and to investigate the optimality 
of the overall computational complexities of CFFTs. For this 
effort, we focus on CFFTs of length n = 2 m -l overGF(2 m ). 

We denote the cyclotomic cosets of the set {0, 1, • • • , n — 
1} modulus n with respect to two as Co, C\, • • • , C^-i, and 



assume that Ci has rrii elements with a representative Sj. It is 
required that rrii divides m, i.e., rrii\m. We divide C;'s into 
d groups — Go, Gi, • • • , G^-i — so that Cj's in each group 
are of the same size. We denote the size of Gj as \Gj\. 

As described in Sec. [2] an n-point CFFT is given by 
ALf, where the matrix A is binary. The product of the 
matrix L and the vector f, i.e., a vector v = Lf', is com- 
puted via fc cyclic convolutions, with L^ being an mi-point 
cyclic convolution. It is a well-known result that an n-point 
cyclic convolution requires O(n log23 ) multiplications and 
additions, respectively flO) . The k cyclic convolutions con- 
tribute to both the multiplicative and additive complexities 
of the CFFT, while computing Av only contributes to the 
additive complexity since A is binary. 

3.1. Multiplicative Complexities of CFFTs over GF(2 m ) 

By the definition of big O notation, an mi-point cyclic con- 
volution has a multiplicative complexity less than crn° S2 3 , 
where c is a constant independent with rrii. Hence the to- 
tal multiplicative complexity of an n-point CFFT is less than 
c^2i=a TO i° S2 3 - As introduced in the beginning of this sec- 
tion, we can group the cyclotomic cosets according to their 
sizes into d groups, and each group Gj has \Gj\ cyclotomic 
cosets. We then have that the size of the cosets in Gj, given 
by gj, divides m, i.e., gj \m, and also d < m. Since log 2 3 > 
1, we have |j(ft) log23 = m(ft) log = i < m{mf°^ I = 
m log2 3 . Hence, the total multiplicative complexity satisfies 

i=a j=o 

=cg L Bkj 3 + cJ2(\ Gj \ mod m/ 9j )g^ 3 

j= o m g i i=o 

<2c m loS23 , 

m 

when m > 4 since d < m < (2 m — l)/m in such cases. Since 
we are considering the asymptotic complexity, we do not need 
to consider the case m < 4. The total multiplicative com- 
plexity of an n-point CFFT is thus <9(n(log 2 n) log2 2 ) since 
rn = log 2 (n + 1). 

Unfortunately, this bound on multiplicative complexities 
of CFFTs cannot be generalized to an arbitrary n. This can 
be shown by counterexamples. For instance, for some lengths 
(say n = 11 or 13), the set of integers {0, 1, • • • , n — 1} 
is partitioned into only two cyclotomic cosets, {0} and 
{1, 2, • • • , n — 1}. Hence, the total multiplicative complexi- 
ties of CFFTs of these lengths are on the order of O(n log2 3 ). 

3.2. Additive Complexities of CFFTs over GF(2 m ) 

Both the convolutions and multiplication between the bi- 
nary matrix A and the vector v = Lf contribute to the 



additive complexity of an rt-point CFFT over GF(2 m ) with 
n — 2 m — 1. Since the additive and multiplicative com- 
plexities of a cyclic convolution have the same order, the 
total additive complexity contributed by the convolutions is 
0(n(log 2 n) log2 2 ). However, the additive complexities of 
CFFTs are dominated by the computing Av. Since A con- 
sists of only and 1, only addition is needed to compute Av. 
We will derive the additive complexity of Av. 

The Four-Russian algorithm fTT) is an efficient algorithm 
for binary matrix multiplication, and it requires 0(n 2 / log 2 n) 
additions for a multiplication between an n x n matrix and 
an n-dimensional vector, referred to as n x n matrix vector 
product (MVP). However, it does not consider the structure 
of M. Next we further reduce the additive complexity of 
computing Av by exploring the inner structure of the matrix 
A. 

As shown in Sec. [5] for an n-point CFFT over GF(2 m ) 
where n — 2 m — 1, the matrix A can be partitioned into 1 x fc 
blocks, and each block A; is of size (2 m — 1) x rrii, and its 
row j is the representation of oP** under a normal basis in the 
field GF(2" li ), where a is an element in GF(2 m ) of order n. 

We first rearrange the rows of the matrix A according to 
the cyclotomic cosets. The rearrangement will result in a new 
matrix A', which can be partitioned into fc x fc blocks. Each 
block A' , is of size m, x to,-, and row t in the block A' ■ „• is the 

representation of a 2 SiSj under a normal basis in GF(2 mj ). 
By the property of normal bases, we know that row t is just a 
right cyclic shift of the previous row, and hence A^ is a cyclic 
matrix [12]. We then partition the vector v into k blocks cor- 
respondingly, and the block Vj has rrii elements. The product 
Av can be recovered by reordering the elements in the vector 
A'v. 

All those rrii x nij blocks can be extended to to x to 
matrices while keeping the cyclic property. Since to, and rrii 
are all factors of to, we first partition an to x m matrix into 
— x — blocks of size m, x to,-, and then set each block 
to Ay . The resulting to x to matrix is still a cyclic matrix. 
After extending all the blocks to to x to blocks in this way, 
we will get a km x km matrix A". To ensure that we can 
recover the multiplication result Av, we should also extend 
each sub-vector v, to a vector of length to by padding zeros 
in the end, resulting in a fern-dimensional vector v". The 
elements in A"v" corresponding to the extended rows are 
simply discarded. 

To utilize this cyclic sub-matrices structure, we construct 
a new matrix B and a new vector u from A" and v", respec- 
tively according to the following rules: 

Bi 2 k+i 1 ,j 2 k+ji — 7 4_i im +i 2 ,j 1 rn+j 2 ' u i 2 k+ix = v i 1 rn+i 2 ' W 

where < < k, < < fn, A'-p B t j are the 

elements in row i and column j in the matrix A" and B, re- 
spectively, and Ui and v'[ are the elements at position i in the 
vector u and v", respectively. The matrix B just reorders the 



rows and columns of A", and reordering the vector v" into 
u ensures that the product Av can be extracted by reordering 
Bu without additional computational complexity. Since A" 
contains k x k blocks of cyclic matrices of size to x to, the 
matrix B is a block-cyclic matrix with to x to block matrices 
of size k x k. 

Since the result Av can be extracted from Bu without 
any additional computational complexity, the computational 
complexity of Bu serves as an upper bound of that of Av. 
Now let us analyze the computational complexity of Bu. The 
matrix B is an m x to block-cyclic matrix, therefore it can 
be computed via O(m los2 3 ) multiplications between a k x k 
matrix and a fc-dimensional vector and 0(to 1oS23 ) additions 
of two fc-dimensional vectors |10|. Since the matrix B is a 
fixed one, all the additions between k x k matrices can be 
precomputed, and it does not contribute to the additive com- 
plexity. Applying the Four-Russian algorithm, the multipli- 
cation between a k x fc matrix and a fc-dimensional vector 
requires 0(fc 2 /log 2 fc) additions. The addition between two 
fc-dimensional vectors requires fc additions, and hence the to- 
tal computational complexity can be written as 

fc 2 fc 2 

O(m los23 ) + O(m lo ^ 3 k) = O(m los ^ ). 

log 2 fc log 2 fc 

We need to find out the lower and upper bounds of fc. Before 
giving these bounds, let us prove two lemmas. 

Lemma 1. In the cyclotomic cosets of {0, 1, • • ■ , 2 m — 2} 
modulus 2 m — 1 with respect to two, there are at most (2 mi — 
1)/ rrii cosets with size rrii, where m,i\m. 

Proof. Consider the nonzero elements in the finite field 
GF(2 m ), which can be represented as and a is a prim- 
itive element in GF(2 m ). By normal basis theorem [9], there 
is at least one normal basis in GF(2 m ). Let us pick a normal 
basis {7 2 ,7 2 ,7 2 ™ } in GF(2 m ). Each element in 
GF(2 m ) has an m-bit binary vector representation under this 
basis, i.e., a? = ~o hl 2 \ and (6 m _i6 m _ 2 ■■■bo) is the 
vector representation of at?. 

It is easy to see that the vector representation of a 2 ^ is 
just a left cyclic shift of that of a? . Therefore, if an integer j 
is in the cyclotomic coset C\, the vector representation of a J 
repeats itself after to, shifts, where to^ is the size of Q. If 
rrii < rn, then to^to, and the vector representation of a? can 
be partitioned into — blocks, all of which are identical and 

1 rrii 

have the same size rrii, otherwise it cannot repeat itself after 
rrii cyclic shifts. Therefore, there are at most (2 rni — l)/TOi 
cyclotomic cosets with size TOj. □ 

Lemma 2. 2 m — 1 < km < 2(2 m — 1), where m is a pos- 
itive integer and k is the number of the cyclotomic cosets of 
{0, 1, • • • , 2 m — 2} modulus 2 rn — 1 with respect to two. 

Proof. The lower bound of km comes from the fact that to is 
the maximum cyclotomic coset size. It suffices to prove the 
upper bound of km. 



Without loss of generality, we assume that the group Co 
contains the cosets with a size of m, and other groups contain 
the cosets with sizes less than m. Therefore by Lemma [T] we 
have 



d-l 



km = \Go\m + ^\Gj\m 



< (2' 

< (2 m - 1 

< (2 m - 1 



1) 



vLfJ 

nii — l.rrii \m 



^mi=l 



m(2 



,+2 



1) 



(2) 



Consider the function f{x) = x — 2 2 . It is easy to check 
that f(x) = 1 - (0.5 In 2)2^ < when x > 8, which 
means f(x) is strictly decreasing when x > 8. We can also 
check that /(9) < 0, and hence f(x) < /(9) < when 
x > 9. Therefore, for an integer m > 9, we have f(m) < 0, 
and m < 2~^~ . Substituting this in Q, we have km < 
(2 m - 1) + (2 m - 2^) < 2(2 m - 1) when m > 10. For 
m < 9, the lemma can be verified by inspection. □ 

We have shown that the total computational complexity of 
evaluating Bu is O(m log2 3 k 2 / log 2 k) additions, hence there 
exists a constant c independent of m and k such that the to- 
tal computational complexity is less than cm. log2 3 k 2 / log 2 k 
additions. By Lemma |2j we have 



cm 



log 2 3 _ 



log 2 fc 



< CTO l0g23 - 



4(2 r - 



to 2 (to + log. 



1-2- 



(3) 



Consider the function f(x) = 2 X — 2 21 — 2x. We can show 
that f(x) = (2 X + 2- a; )ln2 - 2 > when x > 2, and 
/(3) > 0. Therefore, f(x) > /(3) > when a; > 3, which 
implies 2 X — 2~ x > 2x. Then we can show that 



log 2 



log 2 2" 



log ; 



(2 1 



m 

*-2' 



when to > 6. Since we are considering the asymptotic com- 
plexity, the cases when m < 6 do not need to be considered. 
Substituting this result to ([3]), we have 

cm^ A: < c^ 38(2? ": 1)2 = Sc^-y. 

log 2 k 771 TO log 2 5 

Since n = 2'™ — 1, the additive complexity of Bu as 
well as Av is upper bounded by 0(n 2 /(log 2 n) log2 s) and 
is lower than 0(n 2 / log 2 n), the additive complexity of the 
multiplication between an arbitrary n x n binary matrix and 
a vector. 

3.3. Discussions 

To evaluate the tightness of our asymptotic bounds, in Fig. [T] 
we compare our bounds with the actual multiplicative and ad- 
ditive complexities of CFFTs in (5). In Fig. [T] we scale our 
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Comparison of the actual complexities and our 



bounds so that they match the actual complexities when n = 
1023. From Fig.[T] we can see that our bound on additive and 
the multiplicative complexity is rather tight. The solid curves 
corresponding to the actual complexities are very closed to the 
dashed curve corresponding to the theoretical bounds. There- 
fore, the actual additive and multiplicative complexities are 
on the order of 0(n 2 /(log 2 n) log2 1) and 0(n(log 2 n) log2 1), 
respectively. We remark that since we have scaled the theoret- 
ical bounds to match the actual complexity at certain points, it 
is not necessary that the computational complexity is strictly 
smaller than the theoretical bound. 

We then compare asymptotic bounds on the complexities 
of CFFTs and other algorithms in the literature. In (6l, a fast 
DFT algorithm is proposed for GF(2 m ), where to can be any 
positive integer. Both the additive and multiplicative com- 
plexities of this algorithm are of 0(n(\ogn) 2 ). When to is a 
power of two, more efficient algorithms are proposed. For ex- 
ample, Gao's algorithm in [13] has both the additive and mul- 
tiplicative complexities of order 0(n log n log log n), and this 
result is improved by Mateer's algorithm [8], which reduces 
the multiplicative complexity to 0(n log n). When the length 
of the DFT n = s r is a power of some integer s, [7 1 introduces 
a fast DFT algorithm that has a computational complexity of 
rn(s — 1). Note that this algorithm works for arbitrary alge- 
bras rather than finite fields. 

Tab. [T] summarizes the asymptotic computational com- 
plexities of the aforementioned algorithms when we apply 
them to the DFTs with lengths of 2 m - 1 over GF(2 m ). To 
compare the total complexities of these algorithms, we de- 
fine the total complexity to be a weighted sum of the additive 
and multiplicative complexities, and assume that one multi- 
plication over GF(2 m ) has the same complexity as 2m — 1 
additions over the same field. That is, the total complexity is 
given by total = (2to — l)multiplicative + additive. We note 
that this assumption comes from both the hardware and soft- 



Table 1. Asymptotic complexities of (2 m — 1) -point DFT algorithms and their respective restrictions. All logarithms are base 
two. 



Alg. 


Restriction 


Complexities 


Fields 


Lengths (n) 


Multiplicative 


Additive 


Total 


Wang [6| 


GF(2 m ), m arbitrary 


2 m - 1 


0(n(logn) 2 ) 


0(n(logny) 


0(n(logn) J ) 


Cantor 1 7 1 


GF(2 m ), m arbitrary 


2 m _ 1 


0(n a ) 


0(n=*) 


0(n 2 log n) 


Gao| 13| 


GF(2 m ), m = 2 K 


2 m _ 1 


0(n log n log log n) 


0(n log n log log n) 


0(n(log n) 2 log log n) 


Mateer 1 8 1 


GF(2 m ), m = 2 K 


2 m _ 1 


0(n log n) 


0(n log n log log n) 


0(n(logn) 2 ) 


CFFTs 


GF(2 m ), m arbitrary 


2 m _ : 


O(n(logn) log 2 i) 


O(n 2 /(logn) log 2 !) 


O(n 2 /(logn) log 2 S) 



ware considerations j5j. Since we focus on (2 m — l)-point 
DFTs, we have that m = log 2 (n + 1) and 2m — 1 is of order 
O(logn). 

From Tab. [T] CFFTs have the lowest multiplicative com- 
plexities among all algorithms. Furthermore, as shown in 
Fig. |TJb), our asymptotic bound on the multiplicative com- 
plexities of CFFTs is loose. These results confirm the ad- 
vantage of CFFTs in the multiplicative complexities. On the 
other hand, due to their high additive complexities, the ad- 
ditive and overall complexities of CFFTs are asymptotically 
suboptimal. We emphasize the different assumptions for the 
different DFT algorithms in Tab.[T] For all DFT algorithms, it 
assumed that the length n and the size of the underlying field 
are such that a DFT is well-defined. CFFTs and the fast DFT 
algorithm in [6] have no additional assumptions. In contrast, 
the other three algorithms in Tab. [T] all have additional con- 
straints. First, Cantor's algorithm [7] requires n = s r , which 
is often difficult to satisfy. When n = 2 m — 1 (and other values 
of n), the only way to satisfy this condition is n — n 1 due to 
Mihailescu's Theorem p4) . When r = 1, Cantor's algorithm 
has a quadratic additive and multiplicative complexities and 
does not have any computational advantage. Furthermore, the 
algorithms in |8j work only in a field GF(2 m ) with m = 2 K . 

As shown in [5|, CFFTs have lower overall complexities 
than all other DFT algorithms for most lengths up to thou- 
sands of symbols over GF(2 m ) with m < 12. The only 
exception is that for 255-point DFT over GF(2 8 ), the over- 
all complexity of Mateer's algorithm is roughly 4% smaller 
than a 255-point CFFT Although the overall complexities of 
CFFTs are asymptotically suboptimal, CFFTs remain very 
significant since they have the smallest overall complexities 
for most practical lengths. 
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4. HARDWARE IMPLEMENTATIONS OF CFFTS 

The architecture of a (2 m - l)-point CFFT over GF(2 m ) is 
shown in Fig. [2] First, the input vector f is reordered and di- 
vided into k sub-vectors according to the cyclotomic cosets. 
Then we perform an mi-point cyclic convolution between 
each sub-vector i[ and its corresponding pre-computed vector 
hi, as described in Sec. [2] The cyclic convolution results then 
go through the addition network to compute Av. In Fig. [2] 
the reordering module can be realized by wiring only and the 
cyclic convolution modules often can be reused as most of 
the convolutions are of the same sizes. As the computation of 
Av accounts for the majority of the total computational com- 
plexity, the addition network in Fig.|2]requires significant area 
and power in hardware implementations, which makes it dif- 
ficult to implement CFFTs in hardware. Though the additive 
complexity of the additive network can be reduced by tech- 
niques such as the CSE algorithm in |5 |, the resulted addition 
network lacks structure and hence is difficult for hardware 
implementations . 
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Fig. 3. Our circuitry for the addition network in Fig. [2] The 
reordering module has k inputs and m outputs, and the pre- 
addition module outputs O(m log2 3 ) fc-dimensional vectors. 



Fig. 2. Implementation diagram of (2 m — l)-point CFFTs. 



Our additive complexity analysis in Sec. 3.2 leads to a 
structured addition network, which can be implemented by 
the architecture shown in Fig. [3] The vector v, the cyclic con- 
volution results, is first divided into k sub-vectors according 
to the sizes of the cyclotomic cosets, and then each sub-vector 
is extended to m-dimensional by padding zeros. The k Tri- 
dimensional vectors are then reordered into m fc-dimensional 
vectors according to Q. Since the addition network follows 
the bilinear algorithm of an m x m cyclic convolution, the pre- 
and post-additions in the bilinear algorithm correspond to the 
pre-addition and post-addition modules in Fig. [3] respectively, 
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Fig. 4. Our circuitry for k x k MVP modules using the Four- 
Russian algorithm. LUT stands for "look-up table" and s = 

[log 2 k] . 

and the multiplications in the bilinear algorithm correspond 
to the k x k matrix vector product (MVP) modules, which 
compute the product between a k x k binary matrix and a k- 
dimensional vector and can be achieved by simply additions. 

The padding and reordering modules in Fig. [3] do not re- 
quire any logic, and the pre-addition and post-addition have 
a much smaller complexity than the k x k MVP modules as 
shown in Sec. [3] Therefore, the k x k MVP modules are the 
primary source of the additive complexity of computing Av. 
To achieve high throughput, we can implement those k x k 
modules in parallel. Furthermore, the CSE algorithm can re- 
duce the additive complexity of each module. Since k is rather 
small compared with m, the CSE algorithm is more effective 
in simplifying a k x k MVP than an to x to one. However, 
as each module corresponds to a different k x k matrix, the 
CSE reduction results are different, and so are the addition 
networks for each k x k MVP module. Therefore, those mod- 
ules must be implemented separately, and we cannot save any 
chip area by implementing the circuitry in a serial or partly 
parallel fashion. 

To save area and power, the Four-Russian algorithm fTT) 
can be used to implement the k x k modules in these cases, us- 
ing the architecture shown in Fig. |4] According the algorithm, 
the circuitry has three stages to compute a k x k MVP, denoted 
as Mx. The first stage splits x into s = [log 2 k~\ sub-vectors 
Xi's, and computes all the binary combinations of elements in 
each x;. The second stage partitions the matrix M into lxs 
sub-matrices M/s accordingly, and computes M^x, by look- 
up tables generated from the first stage. Finally the third stage 
sums up all M^x/s. Since the first and last stages in Fig. |4] 
are independent from x and M, they can be reused in a serial 
or partly parallel implementation to save chip area. The sec- 
ond stage depends on M, but it still have a regular structure 
that is favorable in hardware implementation. No memory or 
registers are needed in the fully parallel implementation, and 
buffers used to hold the intermediate results are required in 
the serial and partly parallel implementation. 

In addition to its low complexity, the modular structures 
of the architectures in Fig. [3] and Fig. [4] are suitable for hard- 
ware implementations. First, it is easy to apply architectural 
techniques such as pipelining to these architectures for bet- 
ter clock rate and throughput. Second, since the k x k MVP 



modules account for the majority of the complexity, the mod- 
ular structure provides various tradeoff options between area, 
power, and throughput via reusing the k x k MVP modules in 
Fig. 3 as well as the combination and select modules in Fig. 4. 
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